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1  SUMMARY 

The  overall  goal  of  the  X-Graphs  project  was  to  develop  computational  techniques  and 
software  tools  for  graph  analytics.  This  report  describes  the  two  main  components  of  this 
project.  The  first  component  focuses  on  support  for  interactive  graph  analytics 
applications  on  medium  to  large  size  graphs.  The  second  component  focuses  on  support 
for  very  high  perfonnance  graph  analytics  on  large  to  huge  sized  graphs.  The  first 
component  of  X-Graphs  is  SNAP  [1],  SNAP  provides  interactive  analytics  on  graphs 
with  tens  of  billions  of  edges  that  still  fit  into  a  single  multi-CPU  sever  memory.  SNAP 
3.0,  the  most  recent  release,  provides  parallel  implementations  for  many  key  graph 
algorithms,  conversions  between  tables  and  graphs  and  Python  language  bindings.  SNAP 
is  widely  deployed  with  over  a  thousand  downloads  per  month.  The  second  component  of 
X-Graphs  is  Delite  [18].  Delite  is  a  framework  for  building  compilers  for  high- 
performance  Domain  Specific  Languages  (DSLs)  that  can  be  used  to  target 
heterogeneous  architectures  (multicore,  GPU,  cluster,  FPGA).  OptiGraph  is  a  graph  DSL 
that  is  used  to  develop  graph  analytics  applications  that  achieve  very  high  performance 
on  GPUs  with  small  graphs  (millions  of  edges)  and  also  executes  on  clusters  of  CPUs 
with  huge  graphs  (tens  of  billions  of  edges).  The  Delite  DSL  compiler  technology  is  also 
capable  of  targeting  the  emerging  flexible  accelerator  technology  based  on  FPGAs.  Delite 
is  widely  deployed  and  is  being  used  by  industry  and  academia.  Delite  generated  kernels 
form  the  core  of  the  DeepDive  Knowledge  Construction  System. 


2  INTRODUCTION 

The  goal  of  the  X-Graphs  project  was  to  develop  computational  techniques  and  software 
tools  for  analyzing  massive  dynamically  changing  graphs  for  new  trends,  patterns  and 
relationships.  Graphs  are  a  powerful  way  to  represent  complex  data  relationships  in  a 
compact  and  efficient  fashion.  Over  the  course  of  the  project  the  goals  expanded  to 
encompass  the  development  of  software  for  all  components  of  high-performance  data 
analytics. 

Developing  data  analytics  applications  composed  of  massive  graphs  creates  two  problems 
that  were  solved  by  the  X-Graphs  project: 

•  Problem  1 :  Software  abstractions  for  developing  graph  analytic  applications. 
Current  graph  processing  and  analysis  systems  do  not  work  well  and  are 
complicated  to  use  due  to  complicated  APIs.  Our  solution:  Graph  Domain 
Specific  Languages.  The  objective  here  is  to  expedite  the  implementation  of  graph 
analysis  applications  via  a  Graph  Domain  Specific  Language  (GDSL).  The  GDSL 
will  contain  the  key  components  of  graph  analysis  algorithms  (abstract  data 
structure  and  algorithmic  building  blocks)  as  language  elements.  During  this 
project,  we  developed  two  GDSLs:  SNAP.py  for  fast  prototyping  of  graph 
algorithms  and  OptiGraph  for  high-perfonnance  graph  analytics. 

•  Problem  2:  High  perfonnance  platforms  for  graph  processing.  Processing  graphs 
in-memory  of  a  single  multicore  machine  calls  for  parallel  graph  algorithms. 
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Processing  graphs  on  distributed  shared-nothing  architectures  requires  effective 
graph  partitioning  and  computation.  Our  Solution:  Heterogeneous  architecture. 
The  analyses  of  large  graph  streams  require  large  amounts  of  processing  power 
and  no  single  architecture  will  be  suitable  for  all  problems.  Our  GDSL  compiler 
will  allow  us  to  utilize  and  optimize  graph  analytics  applications  for  several 
different  computational  architectures  and  dynamically  determine  which 
architecture  is  most  suitable  for  the  different  parts  of  the  application.  We  will 
consider:  (1)  semi-  streaming  platforms  where  the  data  arrives  as  a  continuous 
stream  on  a  single  large  memory  machine,  (2)  MapReduce/Pregel  architecture 
where  the  data  is  stored  in  a  distributed  file  system. 


3  METHODS,  ASUMPTIONS,  AND  PROCEDURES 
3.1  Software  Abstractions  for  Graph  Analytic  Applications 

To  simplify  the  development  of  graph  applications  we  have  focused  on  the  development 
of  GDSLs.  We  have  developed  SNAP  (Stanford  Network  Analysis  Platform),  a  graph 
mining  and  analytics  platform,  which  scales  to  massive  graphs.  SNAP  is  already 
available  as  open  source  (http://snap.stanford.edu).  Graph  stream  algorithms  have  been 
integrated  with  SNAP  and  made  available  as  part  of  this  platform.  We  also  been 
developing  a  GDSL  based  on  the  Delite  platfonn  for  high-perfonnance  DSL 
development.  The  Delite  platform  and  associated  DSLs  are  available  as  open  source 
(http://stanford-ppl.github.com/Delite) 

Our  commercialization  plan  takes  advantage  of  Stanford’s  unique  connections  to  Silicon 
Valley.  All  the  team  members  already  have  very  strong  connections  to  a  variety  of  local 
companies,  from  large  to  small.  We  will  continuously  interact  with  these  companies,  as 
well  as  with  venture  capital  firms  that  may  fund  spin  outs.  This  model  has  worked  very 
well  for  the  project  Pis  in  the  past  (generating  companies  such  as  Aster  Data  and 
Google). 


3.2  High  performance  Platforms  for  Graph  Processing  and  Data  Analytics 

The  analyses  of  “big  data”  requires  large  amounts  of  processing  power.  It  is  currently  a 
significant  burden  to  develop  the  best  implementations  of  data  analysis  algorithms  for  all 
the  varieties  of  parallel  architecture  (multicore,  GPU,  cluster,  FPGA).  The  traditional 
library-based  approach  to  this  problem  has  portability  and  versatility  limitations.  Instead, 
we  will  pursue  an  approach  to  developing  data-analysis  applications  based  on  a  suite  of 
domain  specific  languages  (DSLs).  We  have  developed  a  DSL  development  framework 
called  Delite  to  simplify  the  process  of  developing  high-performance  easy  to  use  DSLs. 
The  components  of  the  Delite  framework  are  shown  in  Figure  1.  We  have  used  Delite  to 
develop  a  suite  of  DSLs  for  data  analysis  (query  processing,  machine  learning,  and  graph 
processing). 


Approved  for  Public  Release;  Distribution  Unlimited. 

2 


Applications 


DSLs 


Delite 

DSL 

Framework 


Heterogeneous 

Hardware 


Figure  1.  The  Delite  DSL  Framework. 


Delite  substantially  reduces  the  burden  of  developing  high-performance  DSLs  by 
providing  common  reusable  components  that  enable  DSLs  to  quickly  and  efficiently 
target  heterogeneous  hardware  [22]. 

4  RESULTS  AND  ACCOMPLISHMENTS 

4.1  Software  Abstractions  for  Graph  Analytic  Applications 

Under  XDATA,  we  made  our  SNAP  network  analysis  software  and  datasets  collection 
into  a  robust,  open  source  platfonn.  They  became  a  central  resource  for  network  analysts, 
used  by  thousands  every  month. 

We  developed  SNAP  into  a  powerful  tool  for  network  analysis.  We  expanded  core  SNAP 
with  significant  new  functionality:  built-in  operators  for  relational  tables,  primitives  for 
graph  creation,  high-performance  implementation  of  several  key  algorithms,  and  support 
for  graphs  with  attributes.  We  demonstrated  that  interactive  analysis  of  networks  with 
billions  of  edges  is  practical  on  modem  multi-core  large-memory  machines.  We  also 
provide  in  SNAP  high-performance  implementations  of  novel  methods  for  networks 
analysis:  several  methods  for  detection  of  overlapping  communities,  personalized 
PageRank,  node  embeddings  into  a  d-dimensional  space,  counting  of  temporal  motifs, 
motif-based  node  clustering.  The  following  new  SNAP  capabilities  have  made  it 
accessible  to  a  wide  range  of  users,  interested  in  network  analysis:  support  for  Python  -  a 
major  programming  language  for  data  scientists,  documentation,  tutorials,  and  unit  tests. 

The  goal  of  Ringo  is  Ringo  to  provide  an  interactive  environment  for  construction, 
analysis,  and  manipulation  of  graphs  on  a  single  large-memory  multicore  machine.  Ringo 
is  based  on  Snap.py  and  SNAP,  and  uses  Python.  Ringo  now  allows  the  integration  of 
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graphs  and  tables  in  a  single  system  and  provides  powerful  primitives  for  graph 
construction.  We  received  a  best  demo  award  at  SIGMOD  2015  for  the  Ringo  project. 

SNAP  network  analysis  software  has  been  downloaded  over  12,000  times  in  the  last  12 
months.  Its  Github  repository  has  645  stars.  The  paper  on  SNAP  has  1 12  citations, 
according  to  Google  Scholar.  The  SNAP  dataset  collection  received  around  900,000  Web 
hits  in  the  last  12  months.  It  has  been  cited  800  times,  according  to  Google  Scholar. 

OptiGraph  is  a  DSL,  developed  under  the  Delite  DSL  framework,  for  static  graph 
analysis  based  on  the  Green-Marl  DSL  [Hong  et  al.  2012].  OptiGraph  enables  users  to 
express  graph  analysis  algorithms  using  graph-specific  abstractions  and  automatically 
obtain  efficient  parallel  execution.  OptiGraph  defines  types  for  directed  and  undirected 
graphs,  nodes,  and  edges.  It  allows  data  to  be  associated  with  graph  nodes  and  edges  via 
node  and  edge  property  types  and  provides  three  types  of  collections  for  node  and  edge 
storage  (namely,  Set,  Sequence,  and  Order).  Furthermore,  OptiGraph  defines  constructs 
for  Breadth-First  Search  (BFS)  and  Depth-First  Search  (DFS)  order  graph  traversal, 
sequential  and  explicitly  parallel  iteration,  and  implicitly  parallel  in-place  reductions  and 
group  assignments.  The  SNAP  library  can  be  used  to  generate  OptiML  programs  to  take 
advantage  of  the  high-performance  execution  options  provided  by  the  Delite  DSL 
framework.  The  OptiGraph  publication  is  cited  over  200  times. 


4.2  High  performance  Platforms  for  Graph  Processing  and  Data  Analytics 

Under  X-Graphs  we  developed  the  Delite  DSL  Framework  into  a  platform  for  perfonning 
large-scale  graph  analytics,  and  added  new  features  to  improve  perfonnance  in  this 
domain.  The  OptiGraph  DSL  was  the  main  vehicle  for  our  exploration  of  graph  analytics. 
OptiGraph  provides  a  functional  programming  model  for  graph  analytics  and  has 
demonstrated  much  better  perfonnance  with  existing  state  of  the  art  graph  processing 
frameworks  such  as  GraphLab  and  PowerGraph. 

Graph  Algorithms  -  We  developed  optimized  implementations  of  several  important 
graph  algorithms  including  Betweeness  Centrality,  Breadth  First  Search,  Page  Rank, 
Triangle  Counting,  and  Community  Detection.  We  spent  considerable  effort  implanting 
the  Louvain  Method  for  community  detection  for  visualizing  graphs.  These  graph 
algorithms  match  similar  functionality  available  in  SNAP. 

Graph  Layouts  -  Besides  storing  a  graph  in  the  well-known  compressed  sparse  row 
(CSR)  representation,  we  have  investigated  new  ways  of  laying  out  a  graph  in  a  shared 
memory  environment  to  enable  perfonnance  benefits  on  certain  algorithms.  In  graph 
algorithms,  it  is  often  the  case  that  a  join  query  of  some  nature  is  required  to  provide 
rapid  lookup  of  a  node  in  a  neighborhood.  We  have  developed  a  platfonn  for  OptiGraph 
that  parametrically  allows  graph  neighborhoods  as  well  as  computation  frontiers  to  be 
stored  and  computed  in  several  different  graph  data  layouts.  We  have  devised  a 
methodology  to  decide  which  algorithmic  and  dataset  dependent  factors  make  one  data- 
layout  better  than  another.  Armed  with  this  methodology,  we  developed  algorithms  in  the 


Approved  for  Public  Release;  Distribution  Unlimited. 

4 


OptiGraph  compiler  that  automatically  choose  the  optimal  data-layout. 

Graph  Scheduling  -  Delite  has  historically  allowed  for  only  statically  scheduled  parallel 
computations,  but  due  to  the  load  imbalance  performance  losses  that  can  occur  with 
graph  computations  we  extended  the  functionality  of  our  scheduler.  First,  we  are 
implementing  a  dynamic  scheduler  that  will  effectively  provide  work  stealing  across  any 
parallel  loops  in  Delite.  Extending  this  idea  further,  we  have  added  an  asynchronous 
engine  in  Delite  that  allows  graph  computations  to  extend  across  iterations  of  parallel 
computations  without  any  need  for  costly  synchronizations.  Several  existing  graph 
processing  platforms  have  shown  that  doing  this  provides  a  performance  advantage  for 
certain  graph  algorithms. 

We  made  many  enhancements  to  the  general  Delite  infrastructure,  both  in  terms  of 
making  Delite  DSLs  easier  to  create  and  in  terms  of  more  powerful  optimizations  and 
improved  performance.  The  Delite  compiler  initially  only  had  fixed  phases,  and  DSL 
authors  could  only  perfonn  optimizations  as  the  IR  was  being  constructed.  We  have  now 
added  support  for  DSL  authors  to  define  additional  IR  traversals  and  transfonnations. 
Using  this  support,  we  implemented  powerful  data  structure  optimizations  that  were  not 
feasible  without  this  facility,  including  array-of-struct  to  struct-of-array  transformations 
and  dead  field  elimination.  These  optimizations  are  made  possible  by  introducing  a 
Record  type  as  a  first-class  citizen  in  Delite  that  both  DSL  authors  and  DSL  users  can 
extend  to  make  custom  records/structs.  These  optimizations  combined  with  an  improved 
parallel  loop  fusion  algorithm  and  some  other  existing  optimizations  have  produced  order 
of  magnitude  speedups  across  multiple  applications  [1], 

We  developed  support  in  Delite  for  mapping  to  non-CPU  accelerators  such  as  GPUs.  A 
key  element  of  this  expansion  is  our  C++  code  generation  coverage  of  entire  applications. 
This  support  has  many  practical  benefits  including  the  ability  to  generate  vectorized  code, 
generate  NUMA-aware  code  within  a  single  process,  allocate  very  large  arrays,  target 
new  accelerators  such  as  the  Xeon  Phi,  and  communicate  between  the  CPUs  and 
accelerators  with  less  overhead  (i.e.,  copying  memory  in  and  out  of  the  JVM).  To  make 
this  feasible  we  created  our  own  custom  garbage  collection  mechanism  for  Delite 
applications  using  a  combination  of  static  analysis  and  dynamic  reference  counting.  Our 
high-level  analysis  in  Delite  provides  all  the  infonnation  required  to  generate 
vectorizable  loops.  We  can  communicate  this  infonnation  to  the  C  compilers  using  the 
OpenMP  SIMD  pragma  and  let  the  compiler  do  the  detailed  work  of  code  generation  for 
X86  or  Xeon  Phi  vector  units.  Support  for  the  SIMD  pragma  will  soon  be  available  for 
the  GCC  and  LLVM  compilers  with  the  release  of  OpenMP  4.0. 

We  have  complete  support  in  Delite  for  multi-dimensional  mapping  to  hierarchical 
parallelism  in  GPUs.  This  support  provides  a  28. 6x  speedup  over  ID  mappings  and  9.6x 
speedup  over  existing  2D  mappings.  This  support  makes  Delite  generated  code  equal  to 
or  better  than  hand  optimized  CUDA  code. 

Delite  is  now  built  around  a  high-level  parallel  intennediate  language  called  DMLL  the 
Delite  Multiloop  Language  (DMLL).  DMLL  can  be  used  to  generate  efficient  code  for 
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various  combinations  of  devices  (including  C++,  CUDA,  and  OpenCL).  DMLL  provides 
implicitly  parallel  collection  operations  but  lacks  “difficult”  features  such  as  higher  order 
functions,  recursion,  or  an  object  system.  This  enables  high-level  analyses  and 
transformations,  such  as  sophisticated  loop  fusion  and  data  access  optimizations. 

The  DMLL  has  been  used  to  support  two  new  analysis  and  optimization  techniques  that 
can  dramatically  improve  performance  of  DSL  programs. 

•  Partitioning  Analysis  which  decides  which  data  structures  and  parallel  operations 
to  partition  across  multiple  memory  regions  in  a  distributed  memory  or  NUMA 
architecture. 

•  Nested  pattern  transformations  that  optimize  patterns  for  distributed 
heterogeneous  architectures  to  ensure  data  is  consumed  and  produced 

Using  DMLL  we  now  support  in  Delite  for  NUMA  architectures.  This  support  makes  it 
possible  to  run  machine  leaning  algorithms  10-50  times  faster  on  a  48  core  NUMA 
architecture  compared  to  well-known  analytic  environments  like  SPARK  and 
PowerGraph.  Using  DMLL  we  improved  the  OptiGraph  implementation  of  Gibbs 
sampling  so  that  it  is  now  4x  better  than  DimmWitted.  Gibbs  sampling  is  the  main 
inference  algorithm  used  in  DeepDive.  We  integrated  the  OptiGraph  Gibbs  sampler  into 
DeepDive.  This  involves  replacing  the  hand  coded  C++  Gibbs  sampler  (DimmWitted)  in 
DeepDive  with  the  one  generated  from  OptiGraph  DSL  code.  The  result  of  this 
accomplishment  is  simpler  code  and  much  higher  performance  (4x)  than  the  hand  coded 
C++  version. 

Towards  the  end  of  the  X-Graphs  project  we  explored  using  Delite  to  automatically 
generate  hardware  for  DSL  applications  by  adding  the  compiler  analyses  and 
transformations  necessary  to  create  implementations  that  are  far  more  sophisticated  than 
traditional  C-to-Verilog  systems.  To  more  precisely  reason  about  data  structure  usage,  we 
added  first-class  multidimensional  arrays  to  Delite.  With  this  abstraction,  we  can  analyze 
data  access  patterns  in  a  much  more  straightforward  way  than  mapping  everything  to  flat 
arrays.  With  this  new  information,  we  implemented  automatic  tiling  transfonnations  on 
nested  parallel  patterns  to  greatly  improve  locality  by  storing  reused  data  in  local 
memories.  While  this  transformation  is  applicable  to  many  hardware  architectures,  we 
found  it  particularly  essential  for  targeting  FPGAs  which  lack  built-in  caches  and 
memory  pre-fetchers.  We  then  added  code  generators  from  Delite  parallel  patterns  to 
HDL  modules.  In  addition  to  simply  mapping  each  tiled  Delite  parallel  pattern  to  a 
hardware  design,  we  also  automatically  inserted  fine-grained  hierarchical  pipelining 
(metapipelining)  across  Delite  operations  and  at  each  nesting  level  to  greatly  improve 
design  throughput. 

We  developed  a  design  framework  using  this  new  Delite  representation  of  hardware 
using  parameterized  templates  that  captures  locality  and  parallelism  information  at 
multiple  levels  of  nesting.  This  representation,  called  Spatial,  is  designed  to  be 
automatically  generated  from  high-level  intennediate  languages  based  on  parallel 
patterns  such  as  DMLL.  We  have  developed  a  hybrid  area  estimation  technique  which 
uses  template-level  models  and  design-level  artificial  neural  networks  to  account  for 
effects  from  hardware  place-and-route  tools,  including  routing  overheads,  register  and 
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block  RAM  duplication,  and  LUT  packing.  Our  runtime  estimation  accounts  for  off-chip 
memory  accesses.  We  use  our  estimation  capabilities  to  rapidly  explore  a  large  space  of 
designs  across  tile  sizes,  parallelization  factors,  and  optional  coarse-grained  pipelining, 
all  at  multiple  loop  levels.  We  show  that  estimates  average  4.8%  error  for  logic  resources, 
6.1%  error  for  runtimes,  and  are  279  to  6533  times  faster  than  a  commercial  high-level 
synthesis  tool.  We  compare  the  best-performing  designs  to  optimized  CPU  code  running 
on  a  server-grade  6  core  processor  and  show  speedups  of  up  to  16. 7x. 

We  have  developed  the  Spatial  IR  into  a  new  full-fledged  perfonnance  oriented 
programming  language  called  Spatial.  Spatial  is  intended  for  performance  engineers  who 
care  about  optimizing  the  performance  of  applications  by  specifying  parallelism  and 
locality.  The  language  is  especially  valuable  for  the  emerging  class  of  configurable 
accelerator  architectures  such  as  field  programmable  gate-arrays  (FPGAs)  and  coarse- 
grain  reconfigurable  architecture  (CGRAs)  that  are  being  used  to  accelerate  applications 
in  data  analytics  and  machine  learning.  The  Spatial  compiler  uses  techniques  developed 
in  the  Delite  compiler  to  translate  DSL  applications  to  hardware  using  the  Chisel  HDL. 
We  have  demonstrated  much  2-5  times  the  performance  improvement  compared  to 
existing  FPGA  compilers  that  convert  C  to  hardware  using  high-level  synthesis.  The 
growing  use  of  FPGA  based  accelerators  for  machine-learning  and  data  processing  by 
companies  such  as  Microsoft  and  Amazon  suggest  that  the  Spatial  compiler  technology 
will  play  a  key  role  in  the  data-analytics  accelerator  market. 

The  Delite  DSL  Framework  and  associated  DSLs  (OptiGraph,  OptiML,  and  OptiQL)  are 
available  as  open  source  on  GitHub.  The  Delite  technology  is  broadly  used  by  academia 
and  industry.  Industrial  use  includes  Oracle,  Intel  and  a  few  start-up  companies. 

5  CONCLUSIONS 

The  overall  goal  of  the  X-Graphs  project  was  to  develop  computational  techniques  and 
software  tools  for  graph  analytics.  We  were  successful  in  achieving  this  goal  with  the 
development  of  two  key  software  technologies.  SNAP  a  system  designed  for  interactive 
analytics  on  large  graphs  that  fit  into  the  memory  of  a  single  server  and  Delite  a  DSL 
compiler  framework  for  optimizing  DSLs  to  heterogeneous  computer  architectures. 

Taken  together  these  technologies  allow  a  Data  Scientist  developer  to  interactively 
develop  an  analytics  application  using  SNAP  and  then  deploy  it  using  Delite  with  the 
goal  of  achieving  much  high-perfonnance  with  much  larger  dataset  sizes.  Both  SNAP 
and  Delite  are  widely-used  open  source  software  available  on  GitHub.  The  X-Graphs 
project  has  supported  the  generation  of  over  thirty  peer-reviewed  technical  publications. 


6  PUBLICATIONS  SUPPORTED  BY  X-GRAPHS 

1.  J.  Leskovec  and  R.  Sosic.  “Snap:  A  general-purpose  network  analysis  and  graph¬ 
mining  library.”  ACM  Transactions  on  Intelligent  Systems  and  Technology 
(TIST),  vol.  8,  no.  1,  July  2016. 
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3.  C.  De  Sa,  K.  Olukotun,  and  C.  Re,  “Ensuring  Rapid  Mixing  and  Low  Bias  for 
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Conference  on  Machine  Learning,  June  2016.  (Best  Paper  Award) 

4.  D.  Koeplinger,  R.  Prabhakar,  Y.  Zhang,  C.  Delimitrou,  C.  Kozyrakis,  and  K. 
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Symposium  on  Code  Generation  and  Optimization,  March  2016. 
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Analytics  Acceleration,”  Proceedings  of  the  2016  ACM/SIGDA  International 
Symposium  on  Field-Programmable  Gate  Arrays,  February  21-23,  2016. 

8.  C.  De  Sa,  C.  Zhang,  C.  Re,  and  K.  Olukotun,  “Taming  the  Wild:  A  Unified 
Analysis  of  Hogwild! -Style  Algorithms,”  NIPS  '15:  Proceedings  of  the  28th 
Neural  Information  Processing  Systems  Conference,  December  2015. 

9.  C.  De  Sa,  C.  Zhang,  C.  Re,  and  K.  Olukotun,  “Rapidly  Mixing  Gibbs  Sampling 
for  a  Class  of  Factor  Graphs  Using  Hierarchy  Width,”  NIPS  '15:  Proceedings  of 
the  28th  Neural  Information  Processing  Systems  Conference,  December  2015. 

10.  N.  George,  H.  Lee,  D.  Novo,  M.  Owaida,  D.  Andrews,  K.  Olukotun  and  P.  Ienne, 
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1 1.  C.  De  Sa,  K.  Olukotun,  and  C.  Re,  “Global  Convergence  of  Stochastic  Gradient 
Descent  for  Some  Non-convex  Matrix  Problems,”  ICML  '15:  Proceedings  of  the 
32nd Inti.  Conference  on  Machine  Learning,  July  2015. 

12.  R.  Sosic,  A.  Banerjee  and,  R.  Puttagunta,  M,  Raison,  P.  Shah  and  J.  Leskovec, 
“Ringo:  Interactive  Graph  Analytics  on  Big-Memory  Machines,”  Proceedings  of 

Approved  for  Public  Release;  Distribution  Unlimited. 

8 


the  2015  ACM  International  Conference  on  Management  of  Data  (SIGMOD), 

June  2015. 

13.  T.  Rompf,  K.  J.  Brown,  H.  Lee,  A.  K.  Sujeeth,  K.  Olukotun,  “Go  Meta!  A  Case 
for  Generative  Programming  and  DSLs  in  Performance  Critical  Systems,” 

SNAPL  2015:  Symposium  on  Advances  in  Programming  Languages,  May  3-4, 
2015. 

14.  L.  McAfee  and  K.  Olukotun,  “EMEURO:  A  Framework  for  Generating  Multi- 
Purpose  Accelerators  via  Deep  Learning,”  CGO  ’15:  International  Symposium  on 
Code  Generation  and  Optimization,  San  Francisco,  CA,  Feb.  10,  2015. 

15.  K.  Olukotun  and  L.  Hammond,  “Author's  retrospective  for:  Improving  the 
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International  Symposium  on  Microarchitecture,  MICRO  2014,  Cambridge,  United 
Kingdom,  December  13-17,  2014. 
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